This report exoplores a dataset about white wine quality and physicochemical properties. There are approximately 5,000 observations in this dataset. The objective is to find which chemical properties influence the quality of white wines.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
This dataset consists of 13 variables with almost 5,000 observations. However the first variable ‘X’ is the unique identifier which doesn’t have any chemical meaning. In this case, I will exclude this variable and only explore the rest 12 variables, which consists of 11 chemials inputs and 1 output “quality”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The distribution of the variable “quality”" appears normal. Quality scores range from 3-9, most of the quality scores fall on 6 and 5, followed by 7.
## bad average good
## 183 4535 180
The variable “quality.band” is a newly created variable based on the quality score, it is an ordered factor with three possible values: “good” ( quality 3-4), “average”(quality 5-7), “bad”(quality 8-9).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The variable “fixed.acidity” seems normally-distributed, although there are some outliers which are much bigger than the median values. Therefore, in the histogram plot I removed some outliers by limiting the range to (4,10). From the histogram, we can see that most of the wines have a fixed acidity level between 6 and 7 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The distribution of volatile acidity also seems nomal and a little bit positively skewed because of the outliers. Therefore, in the histogram I removed some outliers by limiting the range to (0,0.6). From the histogram, we can see that most of the wines have a volatile acidity between 0.2 and 0.3 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
By adjusting the binwidth, we can see that in the normal distribution, there is one obvious “unusual” citric acid value which is just smaller than 0.5 (after zooming in and subsetting the dataset, I found the value is 0.49)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The original distributon of “residual sugar” is positively skewed, after plotting on a log scale, the distribution appears bimodel, with the residual sugar peaking around 2, and again at around 9. I wonder what this plot look like across the different quality score from 3-9. There are a few outliers as showed in the boxplot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most of the wines contain less than 0.05 g/dm^3 chlorides, however, a few wines contain chlorides more than 0.1 g/dm^3. There are quite a lot of outliers in this variable “chlorides”. In the histogram plot, I reduced the range to (0,0.1) to give a better visulalisation. By removing the outliers, the distribution appears normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The distribution of free sulfur dioxide appears normal, with the peak around 30-40. It means most of the wines contain 30-40 mg/dm^3 free sulfur dioxide, however, the maximum outlier is 289. The range in the histogram is limited to (0,100).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Similar with free sulfur dioxide, the distribution of total sulfur dioxide also appears normal, with some outliers which can be as big as 440. The total sulfur dioxide peaks at around 120.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
The variable “bound.sulfur.dioxide” is a newly created variable, which is the subtraction between total.sulfur.dioxide and free.sulfur.dioxide. The objective was to explore the relationship bettween quality and different forms of sulfur dioxide. From the plots we can see that the distribution of bound sulfur dioxide appears normal, peaking around 100, which is also the median value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density distribution of the wines is very close, more than 75% of the wines’ density is smaller than 1, and the maximum density is 1.039.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
All of the wines’ PH value is between 2.7 - 3.9. The pH distribution appears normal,with median and mean value very close to each other.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
75% of the wines have a sulphates level less than 0.55, while the maximum sulphates level is 1.08. We can see that the distribution is a little bit positively skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The distribution of the alcohol level appears a little bit positively skewed, with most of the wines’ alcohol percentage between 9% - 11%, while the lowest is 8% and the highest is 14.2%.
The output variable “price” is a ordered factor variable ranging from 3-9, of which 3 is the worst and 9 is the best.
The main features in the dataset are alcohol level, pH and different forms of acidity. I think alcohol level and differnt forms of acidity probably contributes most to the wine quality after doing some research on the wine quality.
Density, residual sugar, chlorides, different forms of sulfur dioxide are likey to contriburte to the wine quality.
I created two new variables: quality.band and bound.sulfur.dioxide The variable quality.band is an ordered factor, with three possible values: bad(quality score 3-4), average(quality score 5-7), good(quality score 8-9)
The variable bound.sulfur.dioxide is total.sulfur.dioxide minus free.sulfur.dioxide, I created this because the total sulfur dioxide is the amount of free and bound forms of sulfur dioxide. I wanted to see the correlation between quality and both of free and bound forms of sulfur dioxide.
I have noticed that there are quite a lot of outliers in these variables, I wonder if they have an impact on the quality of the wines. So I added boxplots in addition to histograms to visualise the outliers. However, in the univariate analysis, I decided not to remove any data. In the next section for bivariate plots when exploring the relationship between the features and wine quality, I will remove some outliers when appropriate.
From the correlation matrix, we can see that alcohol level appears to have a strong correlation with quality,followed by density, compared with other features.
From the correlation matrix, we can see some bigger or minor trends on good-quality wines, including:
More detailed plots will be demonstrated to explore the relationship between quality and these features.
Correlation Coefficient between alchol and quality:
## [1] 0.436
Correlation Coefficient between pH and quality:
## [1] 0.099
The correlation coeffecient value seems low, however, from the box plot of pH value across bad, average and good wine, we can see that better-quality wines tend to have higher pH median value.
Correlation Coefficient between volatile.acidity and quality:
## [1] -0.195
From the plots above, we can clearly see that bad-quality wines tend to have higher volatile acidity.
Correlation Coefficient between density and quality:
## [1] -0.307
By removing some outliers in density, the plots above demonstrate a clear relatively strong relationship between density and quality. As the density increases, the quality decreases.
Correlation Coefficient between chlorides and quality:
## [1] -0.21
From the plots above, we can see that chlorides level decreases as quality score / quality band increases. As there are quite a few outliers, I limited the range of the scatterplot to (0,0.1) to give a clearer visualisation.
Correlation Coefficient between residual.sugar and quality:
## [1] -0.098
The relashionship between residual sugar and quality score/band is not very consistent. Howeve, from the boxplot we can still see that two low level of residual sugar is not good for the quality of the wines.
Correlation Coefficient between total.sulfur.dioxide and quality:
## [1] -0.175
Correlation Coefficient between bound.sulfur.dioxide and quality:
## [1] -0.218
With the plots above regarding relationship between quality and total.sulfur.dioxide, quality and bound.sulfur.dioxide,we can see that the relationship between total.sulfur.dioxide and quality is not as consistent as bound.sulfur.dioxide and quality. However, generally speaking, while quality improves, the total and bound sulfur dioxide descrease. To create better plots, as below I applied log10 scale on bound and total sulfur dioxide.
The plots above show the relationship of some features directly with quality. In the plots below, I will explore more relationship between these features with each other.
## [1] "Correlation value of alcohol & density: -0.78"
## [1] "Correlation value of alcohol & residual.sugar: -0.451"
Correlation Coefficient of alcohol vs. chlorides,free.sulfur.dioxide, bound.sulfur.dioxide, total.sulfur.dioxide:
## [1] "Correlation value of alcohol & chlorides: -0.36"
## [1] "Correlation value of alcohol & free.sulfur.dioxide: -0.25"
## [1] "Correlation value of alcohol & bound.sulfur.dioxide: -0.427"
## [1] "Correlation value of alcohol & total.sulfur.dioxide: -0.449"
The plots above shows the correlation between alcohol level and other features such as density, residual.sugar, chlorides and different forms of sulfur dioxide. We can see that alcohol and density have a very strong correlation. As alcohol level descreases, the density increases. The other features also appear weaken alcohol level as they increases.
Correlation Coefficient of pH vs. fixed.acidity, citric.acid, volatile.acidity:
## [1] "Correlation value of pH & fixed.acidity: -0.426"
## [1] "Correlation value of pH & citric.acid: -0.164"
## [1] "Correlation value of pH & volatile.acidity: -0.032"
The above plots show the relationship between pH and different forms of acidity. It is clear to see that pH has a very strong correlation with fixed acidity, compared to the other forms of acidity (citric acid, volatile acidity).
Correlation Coefficient of density vs. fixed.acidity, residual.sugar, total.sulfur.dioxide, alcohol:
## [1] "Correlation value of density & fixed.acidity: 0.265"
## [1] "Correlation value of density & residual.sugar: 0.839"
## [1] "Correlation value of density & total.sulfur.dioxide: 0.53"
## [1] "Correlation value of density & alcohol: -0.78"
Density has very strong relationship with residual sugar, total sulfur dioxide, followed by fixed acidity. As already mentioned in the previous analysis, density and alcohol have a very strong correlation, they weaken each other.
Correlation Coefficient of total.sulfur.dioxide vs. residual.sugar, chlorides, bound.sulfur.dioxide, free.sulfur.dioxide:
## [1] "Correlation value of total.sulfur.dioxide & residual.sugar: 0.401"
## [1] "Correlation value of total.sulfur.dioxide & chlorides: 0.199"
## [1] "Correlation value of total.sulfur.dioxide & bound.sulfur.dioxide: 0.922"
## [1] "Correlation value of total.sulfur.dioxide & free.sulfur.dioxide: 0.616"
As total sulfur dioxide is the sum of bound and free sulfur dioxide, we can see the strong correlation between them. Other than that, total.sulfur.dioxide and residual.sugar apperas to strenghen each other, so as bound.sulfur.dioxide and chlorides, but in a minor trend.
Correlation Coefficient of citric.acid vs. fixed.acidity, volatile.acidity:
## [1] "Correlation value of fixed.acidity & citric.acid: 0.289"
## [1] "Correlation value of volatile.acidity & citric.acid: -0.149"
It is intersting to find the relationship between the three different forms of acidity. The feature citric.acid and fixed.acidity appear to stengthen each other, while citric.acid and volatile.acidity appear to weaken each other.
Density seems to have strong relationships with residual sugar and total sulfur dioxide.
The strongest relationship pH and fixed acidity; total sulfur dioxide and bound sulfur dioxide. And as mentioned above, density has strong relationship with alcohol and residual sugar.
Correlation Coefficient of chlorides & alcohol across quality:
## [1] "Quality 3 : -0.353"
## [1] "Quality 4 : -0.387"
## [1] "Quality 5 : -0.223"
## [1] "Quality 6 : -0.32"
## [1] "Quality 7 : -0.555"
## [1] "Quality 8 : -0.512"
## [1] "Quality 9 : -0.51"
Correlation Coefficient of chlorides & alcohol across quality.band:
## [1] "Quality Band bad : -0.371"
## [1] "Quality Band average : -0.351"
## [1] "Quality Band good : -0.516"
Alcohol and chlorides are two strong features which influence the quality of white wines. The plot above shows the distribution of chlorides and alcohol across the quality scores. The second plot filters out some outliners and the average-quality wines to provide a clearer trend.
Correlation Coefficient of citric.acid & fixed.acidity across quality:
## [1] "Quality 3 : 0.337"
## [1] "Quality 4 : 0.515"
## [1] "Quality 5 : 0.294"
## [1] "Quality 6 : 0.281"
## [1] "Quality 7 : 0.266"
## [1] "Quality 8 : 0.186"
## [1] "Quality 9 : 0.55"
Correlation Coefficient of citric.acid & fixed.acidity across quality.band:
## [1] "Quality Band bad : 0.476"
## [1] "Quality Band average : 0.283"
## [1] "Quality Band good : 0.209"
The relationship between fixed.acidity and citric.acid becomes weaker as the quality improves.
From the second plot above, we can see that good wines tend to have lower fixed acidity; also the citric acid of good wines tends to have a smaller variance.
Correlation Coefficient of pH & fixed.acidity across quality:
## [1] "Quality 3 : -0.755"
## [1] "Quality 4 : -0.466"
## [1] "Quality 5 : -0.427"
## [1] "Quality 6 : -0.379"
## [1] "Quality 7 : -0.492"
## [1] "Quality 8 : -0.476"
## [1] "Quality 9 : -0.828"
Correlation Coefficient of citric.acid & fixed.acidity across quality.band:
## [1] "Quality Band bad : -0.514"
## [1] "Quality Band average : -0.419"
## [1] "Quality Band good : -0.456"
The relationshp between pH and fixed.acidity is consistently strong across the different quality score and band.
Correlation Coefficient of total.sulfur.dioxide & bound.sulfur.dioxide across quality.band:
## [1] "Quality Band bad : 0.879"
## [1] "Quality Band average : 0.928"
## [1] "Quality Band good : 0.874"
The relationshp between total.sulfur.dioxide and bound.sulfur.dioxide is consistently strong across the different quality band.
Correlation Coefficient of total.sulfur.dioxide & free.sulfur.dioxide across quality.band:
## [1] "Quality Band bad : 0.708"
## [1] "Quality Band average : 0.609"
## [1] "Quality Band good : 0.616"
Similar as above, the relationshp between total.sulfur.dioxide and free.sulfur.dioxide is also consistently strong across the different quality band.
Correlation Coefficient of fixed.acidity & volatile.acidity across quality.band:
## [1] "Quality Band bad : -0.047"
## [1] "Quality Band average : -0.033"
## [1] "Quality Band good : -0.127"
The correlation between fixed and volatile acidity is very weak. But from the plots above, we can see that compared to bad quality wines, good quality wines appear to have lower level of fixed acidity and volatile acidity. Much more outliers of fixed and volatile acidity can be found in the bad quality wines.
Correlation Coefficient of residual.sugar & density across quality.band:
## [1] "Quality Band bad : 0.741"
## [1] "Quality Band average : 0.848"
## [1] "Quality Band good : 0.821"
The correlation of density and residual sugar is consistently strong. As mentioned before, good wines tend to have smaller density, which means lower level of residual sugar.
In the multivatiate plots section, i explored the features across quality and quality band: density vs. residual sugar, volatile acidity vs. fixed acidity, free.sulfur.dioxide vs. total.sulfur.dioxide, pH vs. fixed acidity, alcohol, chlorides, citric acid.
It is interesting that citric acid seems to have some relationship with volatile acidity. Thery are different forms of acid in the wine, but they seem to weaken each other, though the relationship is not strong.
No
This set of box plots illustrates the effect of alcohol level on white wine quality. Generally speaking, better quality of wines tend to have higher alcohol level. However, the wines with quality scoring 5 have lower alcohol level than wines with quality scoring 3 and 4.
By removing the outliers, filtering out the average-quality wines, this plot clearly demonstrates the relationship between density and residual sugar. Across the different qualities, as the residual sugar level increases, the density increases. This plot also shows the trend that good-quality wines seem to have smaller density and lower level of residual sugar.
This plot shows the strong correlation between pH and Fixed acidity across bad and good wines. As the fixed acidity lever increases, PH value decreases. It makes big sense because PH value is a numeric scale used to specify the acidity (when PH is less value than 7), the smaller the ph value, the more acidic it is. The plot also demonstrates the trend that it is not good for the wine quality when pH is too low (which means probably too much acidity).
Through this exploratory data analysis on the white wine dataset, I identified the key factors which influence the quality of the wines, including alcohol percentage, pH / acidity , density and chlorides.
At the beginning of the analysis, I struggled because the correlation between the variables in this dataset is generally weak, except only a few ones with relatively stronger relationship. The way I used to sort this issue out is to find the strongest features for quality, which include alcohol and density. And then, I tried to find the strongest features for alcohol, which include residual.sugar, chlorides and differnent forms of sulfur dioxide. The same exploraton was done for density, which is the second strongest feature for quality. The strongest features for density include residual.sugar and different forms of sulfur dioxide. In this way, I tried to find different levels of connection between the variables.
However, as the quality score is measured subjectively by wine experts, I believe the correlations within these factors mentioned above are within reasonable bounds. Further study on statistics is suggested in order to confirm the hypothesis quantitatively.